iT邦幫忙

第 12 屆 iThome 鐵人賽

DAY 6
0
AI & Data

Machine Learning系列 第 6

Day 06-Feature Engineering -- 2. Categorical Encoding(5)

  • 分享至 

  • xImage
  •  

2.1 One hot encoding
2.2 Count and Frequency encoding
2.3 Target encoding / Mean encoding
2.4 Ordinal encoding
2.5 Weight of Evidence
2.6 Rare label encoding
2.7 Helmert encoding
2.8 Probability Ratio Encoding
2.9 Label encoding
2.10 Feature hashing
2.11 Binary encoding & BaseN encoding
2.12 Sum Encoder(Deviation Encoding of Effect Encoding)
2.13 Backward Difference
2.14 Polynomial
2.15 Leave One Out
2.16 James_Stein
2.17 M-estimator
2.18 CatBoost encoding

將使用這個data-frame,有兩個獨立變數或特徵(features)和一個標籤(label or Target),共有十筆資料。
Rec-No | Temperature | Color | Target |
------------- | -------------------------- | -------------
0 | Hot | Red | 1
1 | Cold | Yellow | 1
2 | Very Hot | Blue | 1
3 | Warm | Blue | 0
4 | Hot | Red | 1
5 | Warm | Yellow | 0
6 | Warm | Red | 1
7 | Hot | Yellow | 0
8 | Hot | Yellow | 1
9 | Cold | Yellow | 1

2.12 Sum Encoder(Deviation Encoding of Effect Encoding)

比較一個變數中某一類別的平均值和全部類別的平均值。

import category_encoders as ce

Sum_encoder = ce.SumEncoder(cols=['Temperature'])
df_se = Sum_encoder.fit_transform(df['Temperature'])
df_se.columns = ['se_'+str(i) for i in df_se.columns]
df = pd.concat([df, df_se], axis=1)
df

/ | Temperature | Color | Target |se_intercept|se_Temperature_0| se_Temperature_1| se_Temperature_2
------------- | ------------- | -------------
0 | Hot | Red | 1 |1 | 1.0 | 0.0 | 0.0
1 | Cold | Yellow | 1|1 | 0.0 | 1.0 | 0.0
2 | Very Hot | Blue | 1|1 | 0.0 | 0.0 | 1.0
3 | Warm | Blue | 0|1 | -1.0 | -1.0 | -1.0
4 | Hot | Red | 1|1 | 1.0 | 0.0 | 0.0
5 | Warm | Yellow | 0|1 | -1.0 | -1.0 | -1.0
6 | Warm | Red | 1|1 | -1.0 | -1.0 | -1.0
7 | Hot | Yellow | 0|1 | 1.0 | 0.0 | 0.0
8 | Hot | Yellow | 1|1 | 1.0 | 0.0 | 0.0
9 | Cold | Yellow | 1|1 | 0.0 | 1.0 | 0.0

2.13 Backward Difference

Backward Difference Encoding 是一變數中,某一類別的平均值和其之前類別的平均值。這個方法對於名目變數(nominal)或順序變數(ordinal)較有效益。

ce_backward = ce.BackwardDifferenceEncoder(cols=['Temperature'])
df_ce = ce_backward.fit_transform(df['Temperature'])
df_ce.columns = ['bk_'+str(i) for i in df_ce.columns]
df = pd.concat([df, df_ce], axis=1)
df

/ | Temperature | Color | Target |bk_intercept|bk_Temperature_0| bk_Temperature_1| bk_Temperature_2
------------- | ------------- | -------------
0 | Hot | Red | 1 |1 | -0.75 | -0.5 | -0.25
1 | Cold | Yellow | 1|1 | 0.25 | -0.5 | -0.25
2 | Very Hot | Blue | 1|1 | 0.25 | 0.5 | -0.25
q3 | Warm | Blue | 0|1 | 0.25 | 0.5 | 0.75
4 | Hot | Red | 1|1 | -0.75 | -0.5 | -0.25
5 | Warm | Yellow | 0|1 | 0.25 | 0.5 | 0.75
6 | Warm | Red | 1|1 | 0.25 | 0.5 | 0.75
7 | Hot | Yellow | 0|1 | -0.75 | -0.5 | -0.25
8 | Hot | Yellow | 1|1 | -0.75 | -0.5 | -0.25
9 | Cold | Yellow | 1|1 | 0.25 | -0.5 | -0.25

2.14 Polynomial

Polynomial coding 是一個較少用的方法,但他卻是一個最能反映變數資訊的方法。polynomial coding的目的是要辨識 dependent 和 independent variables之間的線性和非線性關係的傾向。尋找類別變數的 linear, quadratic and cubic 傾向。

ce_poly = ce.PolynomialEncoder(cols=['Temperature'])
dfp = ce_poly.fit_transform(df['Temperature'])
dfp.columns = ['poly_'+str(i) for i in dfp.columns]
df = pd.concat([df, dfp], axis=1)
df

/ | Temperature | Color | Target | poly_intercept | poly_Temperature_0 | poly_Temperature_1 | poly_Temperature_2
------------- | ------------- | -------------
0 | Hot | Red | 1 |1 | -0.670820 | 0.5 |-0.223607
1 | Cold | Yellow | 1|1 |-0.223607 | -0.5 | 0.670820
2 | Very Hot | Blue | 1|1 |0.223607 | -0.5 | -0.670820
3 | Warm | Blue | 0|1 | 0.670820| 0.5| 0.223607
4 | Hot | Red | 1|1 |-0.670820 | 0.5 | -0.223607
5 | Warm | Yellow | 0|1 |0.670820| 0.5| 0.223607
6 | Warm | Red | 1|1 | 0.670820| 0.5| 0.223607
7 | Hot | Yellow | 0|1 |-0.670820 | 0.5 | -0.223607
8 | Hot | Yellow | 1|1 |-0.670820 | 0.5 |-0.223607
9 | Cold | Yellow | 1|1 | -0.223607 | -0.5 | 0.670820

2.15 Leave One Out

類似 Target Encoding。但我們會排除目前資料列的標籤(target),當我們在計算每一類別對應的標籤(target)的平均值時。這這樣可以降低outliers效應

X = df.drop(['Target'], axis=1)
y = df['Target']
ce_leave = ce.LeaveOneOutEncoder(cols=['Temperature'])
dfl = ce_leave.fit_transform(X, y)
dfl
/ Temperature Color
0 0.666667 Red
1 1.000000 Yellow
2 0.700000 Blue
3 0.500000 Blue
4 0.666667 Red
5 0.500000 Yellow
6 0.000000 Red
7 1.000000 Yellow
8 0.666667 Yellow
9 1.000000 Yellow

2.16 James_Stein Encoding

James_Stein Encoding 是一個 target-based encoder。類似 target encoding,但它產生的值會趨向類別變數對應標籤的群體平均數,所以它是個別類別對應標籤的平均值和群體平均數的加權總數。

James_Stein estimator 有一個實際上的限制:它是為平均分配(normal distributions)設計的,所以不適合分類(classification)機器學習模型,要克服這個問題,我們可以將二進位元的標籤(Target)轉換為 log-odds ratio 或者使用 beta 分配(beta distribution)。

ce_James = ce.JamesSteinEncoder(cols=['Temperature'])
dfj = ce_James.fit_transform(X, y)
dfj
/ Temperature Color
0 0.741379 Red
1 1.000000 Yellow
2 1.000000 Blue
3 0.405229 Blue
4 0.741379 Red
5 0.405229 Yellow
6 0.405229 Red
7 0.741379 Yellow
8 0.741379 Yellow
9 1.000000 Yellow

2.17 M-estimator

M-Estimate Encoder 是一個簡單版的 Target Encoder,類似 James Stein encoder,但使用一個有額外的參數(m)的群體平均數來調整每一個類別對應標籤的平均值,這個參數的預設值是 1。

ce_M_estimator = ce.MEstimateEncoder(cols=['Temperature'])
dfM = ce_M_estimator.fit_transform(X, y)
dfM
/ Temperature Color
0 0.740 Red
1 0.900 Yellow
2 0.850 Blue
3 0.425 Blue
4 0.740 Red
5 0.425 Yellow
6 0.425 Red
7 0.740 Yellow
8 0.740 Yellow
9 0.900 Yellow

上一篇
Day-5 Feature Engineering -- 2. Categorical Encoding(4)
下一篇
Day7-Feature Engineering -- 2. Categorical Encoding(6)
系列文
Machine Learning32
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言